Let’s reconsider the cross-sectional wage data sampled from the 1976 US Population Survey1 that we explored last time. It consists of 526 observations of average hourly wages, and various covariates such as education, race, gender, or marital status.
# uncomment the install.packages if you do not have this package
# install.packages('wooldridge')
require('wooldridge')
data('wage1')
head(wage1)
## wage educ exper tenure nonwhite female married numdep smsa northcen
## 1 3.10 11 2 0 0 1 0 2 1 0
## 2 3.24 12 22 2 0 1 1 3 1 0
## 3 3.00 11 2 0 0 0 0 2 0 0
## 4 6.00 8 44 28 0 0 1 0 1 0
## 5 5.30 12 7 2 0 0 1 1 0 0
## 6 8.75 16 9 8 0 0 1 0 1 0
## south west construc ndurman trcommpu trade services profserv profocc
## 1 0 1 0 0 0 0 0 0 0
## 2 0 1 0 0 0 0 1 0 0
## 3 0 1 0 0 0 1 0 0 0
## 4 0 1 0 0 0 0 0 0 0
## 5 0 1 0 0 0 0 0 0 0
## 6 0 1 0 0 0 0 0 1 1
## clerocc servocc lwage expersq tenursq
## 1 0 0 1.131402 4 0
## 2 0 1 1.175573 484 4
## 3 0 0 1.098612 4 0
## 4 1 0 1.791759 1936 784
## 5 0 0 1.667707 49 4
## 6 0 0 2.169054 81 64
As a refresher, let’s look at the distribution of wages, in dollars per hour. Recall that if we subset the data by gender, we get two histograms that look different. This suggests that gender likely has an effect on wage, and we should include it in our regression analysis.
library(dplyr)
library(ggplot2)
# A histogram of wages
wage1 %>% ggplot() +
geom_histogram(aes(x = wage),
fill = 'light blue',
color = 'blue')
# we can color by gender, for example
# ggplot is great for making clear labels and customizing plots with pretty colors
wage1 %>% mutate(gender = as.factor(female)) %>%
ggplot() +
geom_histogram(aes(x = wage,fill = gender), position = "dodge") +
scale_fill_manual(labels = c("Men", "Women"), values = c("seagreen", "orchid"))
# we could also just look at simple boxplots:
ggplot(wage1) + geom_boxplot(aes(x=factor(female), y = log(wage))) +
labs(x = "Gender") + scale_x_discrete(labels = c("Men", "Women"))
In the previous lab, we ignored the gender variable and predicted wage only by education. Why might this have been a problem?
Multiple linear regression allows us to include multiple variables in our linear model. But which variables should we choose? This is a difficult question that we will discuss later in the course. For now, let’s visually explore the effect of some variables.
Pick a few variables that you think might affect wages, and make some plots that would visually display the effect of this variable on wage. (Hint for making better plots: last time, we discussed that taking the log of wage was a good idea.)
The variables in the dataset are:
# Make plots examining the effect of some other variables (besides education) on wage
# Example 1: Scatterplots (for continuous variables)
ggplot(wage1) + geom_point(aes(x=exper, y = log(wage)))
ggplot(wage1) + geom_point(aes(x=tenure, y = log(wage)))
# Example 2: Boxplots (for categorical variables)
ggplot(wage1) + geom_boxplot(aes(x=factor(married), y = log(wage)))
ggplot(wage1) + geom_boxplot(aes(x=factor(profocc), y = log(wage)))
Let’s create a model that examines the effects of gender, education, and tenure on (log) wages. What is the regression model? Fit this model and print the coefficients:
# fit a multiple linear regression of wage on gender, education, and tenure:
model<- lm(log(wage) ~ female + educ + tenure, data = wage1)
summary(model)
##
## Call:
## lm(formula = log(wage) ~ female + educ + tenure, data = wage1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.96883 -0.25262 -0.03383 0.24687 1.29983
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.633125 0.091382 6.928 1.26e-11 ***
## female -0.297052 0.037470 -7.928 1.36e-14 ***
## educ 0.081354 0.006643 12.246 < 2e-16 ***
## tenure 0.021634 0.002588 8.359 5.78e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4188 on 522 degrees of freedom
## Multiple R-squared: 0.3828, Adjusted R-squared: 0.3793
## F-statistic: 107.9 on 3 and 522 DF, p-value: < 2.2e-16
What are the estimated coefficients? What do they mean? How is the interpretation different from simple linear regression?
Gender is a categorial variable: it is not a numeric observation like years of education. How can we use this variable in a linear model?
A dummy variable, or indicator varible, is used to represent the levels of a categorical variable. In this dataset, we can see that the gender variable has already been transformed into a dummy variable:
head(wage1)
## wage educ exper tenure nonwhite female married numdep smsa northcen
## 1 3.10 11 2 0 0 1 0 2 1 0
## 2 3.24 12 22 2 0 1 1 3 1 0
## 3 3.00 11 2 0 0 0 0 2 0 0
## 4 6.00 8 44 28 0 0 1 0 1 0
## 5 5.30 12 7 2 0 0 1 1 0 0
## 6 8.75 16 9 8 0 0 1 0 1 0
## south west construc ndurman trcommpu trade services profserv profocc
## 1 0 1 0 0 0 0 0 0 0
## 2 0 1 0 0 0 0 1 0 0
## 3 0 1 0 0 0 1 0 0 0
## 4 0 1 0 0 0 0 0 0 0
## 5 0 1 0 0 0 0 0 0 0
## 6 0 1 0 0 0 0 0 1 1
## clerocc servocc lwage expersq tenursq
## 1 0 0 1.131402 4 0
## 2 0 1 1.175573 484 4
## 3 0 0 1.098612 4 0
## 4 1 0 1.791759 1936 784
## 5 0 0 1.667707 49 4
## 6 0 0 2.169054 81 64
table(wage1$female)
##
## 0 1
## 274 252
In R, we could fit a model by first transforming any categorical variable into a dummy variable, like the wage dataset has done. This is not necessary, however. We could just as easily use a factor variable in the lm() function.
# let's add a categorial variable to indicate gender rather than using the dummy variable
wage1$gender<- wage1$female
wage1$gender[wage1$gender == 1]<- "women"
wage1$gender[wage1$gender == 0]<- "men"
wage1$gender<- factor(wage1$gender) # now we have gender stored as a factor variable
table(wage1$gender)
##
## men women
## 274 252
lm(log(wage) ~ educ + gender, data = wage1) # factor variable
##
## Call:
## lm(formula = log(wage) ~ educ + gender, data = wage1)
##
## Coefficients:
## (Intercept) educ genderwomen
## 0.8263 0.0772 -0.3609
lm(log(wage) ~ educ + female, data = wage1) # dummy variable
##
## Call:
## lm(formula = log(wage) ~ educ + female, data = wage1)
##
## Coefficients:
## (Intercept) educ female
## 0.8263 0.0772 -0.3609
So if we give lm() a categorical variable, lm() will transform this into a dummy variable behind the scenes so that it can do the mathematical computation of fitting the model.
What happens if we have more than two categories for a categorical variable? For example, we might be interested in how the region of the country affects wages. We could collect this information in our dataset with a variable region with categories north/central, south, west, and other. How can we make a dummy variable for this variable?
One idea might be to just use values of \(0, 1, 2, 3\) to represent the different regions. Are there problems with this approach?
The problem here is that using non-binary numbers imposes an ordering on the categories of our regions. By giving the regions numeric values, we are essentially saying that wages for the north will be lower than the south, which are lower than the west, which are lower than the other category. We do not want to impose this sort of ordering on the variable.
The solution is to create multiple binary dummy variables: specifically, one dummy variable for whether or not the region is “south”, another dummy variable for whether or not the region is “west”, and one dummy variable for whether or not the region is “north/central”. Note that we only need 3 dummy variables. (Why?)
The wage dataset is actually already set up with dummy variables. Let’s imagine that we had a categorical variable region to see how the dummy variables correspond to the region variable.
# make a categorical variable region
wage1$region<- "other"
wage1$region[wage1$northcen == 1]<- "northcen"
wage1$region[wage1$south == 1]<- "south"
wage1$region[wage1$west == 1]<- "west"
wage1$region<- factor(wage1$region, levels = c("other", "northcen", "south", "west"))
# recording the data as a categorical variable makes plotting multiple categories simpler
ggplot(wage1) + geom_boxplot(aes(x=region, y = log(wage)))
# take a look at how the dummy variables correspond to the region variable
wage1[490:515,c("wage", "northcen", "south", "west", "region")]
## wage northcen south west region
## 490 2.90 0 1 0 south
## 491 6.25 0 1 0 south
## 492 4.55 0 1 0 south
## 493 3.28 0 1 0 south
## 494 2.30 1 0 0 northcen
## 495 3.30 1 0 0 northcen
## 496 3.15 1 0 0 northcen
## 497 12.50 1 0 0 northcen
## 498 5.15 0 0 1 west
## 499 3.13 0 0 1 west
## 500 7.25 0 0 1 west
## 501 2.90 0 0 1 west
## 502 1.75 0 0 1 west
## 503 2.89 0 0 1 west
## 504 2.90 0 0 1 west
## 505 17.71 0 0 1 west
## 506 6.25 0 0 1 west
## 507 2.60 0 0 1 west
## 508 6.63 0 0 1 west
## 509 3.50 0 0 1 west
## 510 6.50 0 0 1 west
## 511 3.00 0 0 1 west
## 512 4.38 0 0 1 west
## 513 10.00 0 0 0 other
## 514 4.95 0 0 0 other
## 515 9.00 0 0 0 other
What do the dummy variables mean here? How do they correspond to the region variable?
Now if we wanted to include region in our model, we could use the region variable, and the lm() function would create its own dummy variables, just like the wage dataset has already done. Fit a linear model of log wages by education and region. Compare the results if you give lm() region as a categorial variable versus if you give lm() the dummy variables. Are they the same?
# fit a linear model of education by region,
# first by using the factor variable region:
lm(log(wage) ~ educ + region, data = wage1)
##
## Call:
## lm(formula = log(wage) ~ educ + region, data = wage1)
##
## Coefficients:
## (Intercept) educ regionnorthcen regionsouth
## 0.62310 0.08210 -0.06986 -0.06023
## regionwest
## 0.04562
# next by using the dummy variables for region:
lm(log(wage) ~ educ + northcen + south + west, data = wage1)
##
## Call:
## lm(formula = log(wage) ~ educ + northcen + south + west, data = wage1)
##
## Coefficients:
## (Intercept) educ northcen south west
## 0.62310 0.08210 -0.06986 -0.06023 0.04562
# in both cases, R constructs the design matrix in the same way:
head(model.matrix(~educ + region, data = wage1))
## (Intercept) educ regionnorthcen regionsouth regionwest
## 1 1 11 0 0 1
## 2 1 12 0 0 1
## 3 1 11 0 0 1
## 4 1 8 0 0 1
## 5 1 12 0 0 1
## 6 1 16 0 0 1
head(model.matrix(~educ + northcen + south + west, data = wage1))
## (Intercept) educ northcen south west
## 1 1 11 0 0 1
## 2 1 12 0 0 1
## 3 1 11 0 0 1
## 4 1 8 0 0 1
## 5 1 12 0 0 1
## 6 1 16 0 0 1
In summary, if we give the lm() function a categorial variable, it will create dummy variables as shown above to represent the categorical variable.
How is lm() computing these coefficients? Let’s check and make sure R’s output matches the mathematical formula for computing the coefficents that was discussed in lecture. Compute the coefficients of our original model (log(wage) ~ female + educ + tenure) without using lm. (Hint: set up the X matrix first; then use the formula from class.)
# compute the regression coefficients without using lm()
X<- cbind(1, wage1[,c("female", "educ", "tenure")])
X<- as.matrix(X)
head(X)
## 1 female educ tenure
## 1 1 1 11 0
## 2 1 1 12 2
## 3 1 0 11 0
## 4 1 0 8 28
## 5 1 0 12 2
## 6 1 0 16 8
y<- log(wage1$wage)
beta_hat<- solve(t(X) %*% X, t(X) %*% y) # equation from lecture
# the solve function takes 2 arguments: the left and right hand sides of a linear system
# verify that the computed coefficients match the lm() output
beta_hat
## [,1]
## 1 0.63312533
## female -0.29705197
## educ 0.08135374
## tenure 0.02163383
model$coefficients
## (Intercept) female educ tenure
## 0.63312533 -0.29705197 0.08135374 0.02163383
It’s always a good ideas to look at residuals. Compute and plot the residuals here:
# Compute residuals
residuals<- model$residuals
predicted<- model$fitted.values
# Plot residuals
par(mfrow=c(1,1))
hist(residuals, breaks = 15)
plot(predicted, residuals)
abline(0,0, col = "red")
# Plot residuals against x variables
plot(residuals ~ wage1$educ)
abline(0,0, col = "red")
plot(residuals ~ wage1$tenure)
abline(0,0, col = "red")
Observations about residuals?
We might be interested in visualizing the results of multiple regression. In simple regression, since we only have one response variable and one predictor variable, we can easily visualize this with a scatterplot and fitted line. Can we still do this with multiple regression? Try this if you’d like; discuss any problems.
# visualize the results of our multiple regression model
plot(log(wage) ~ educ, data = wage1)
abline(model, col = "red")
Clearly this is not what we were hoping for. What went wrong?
To see what’s happening, let’s try a simpler model, using only gender and education as explanatory varibles.
model2<- lm(log(wage) ~ educ + female, data = wage1)
coefs<- model2$coefficients
coefs
## (Intercept) educ female
## 0.82626936 0.07720333 -0.36086544
Here we can interpret the coefficients as being a slope and intercept if we condition on the gender variable. For men, their value for the female variable is 0, so their regression line will be \[\hat{y} = \hat{\beta}_0 + \hat{\beta}_1*educ + \hat{\beta}_2*0 = \hat{\beta}_0+ \hat{\beta}_1 * educ. \] For women, their value for the female variable is 1, so their regression line will be \[\hat{y} = \hat{\beta}_0 + \hat{\beta}_1*educ + \hat{\beta}_2*1 = (\hat{\beta}_0 + \hat{\beta}_2) + \hat{\beta}_1*educ.\]
coplot(log(wage)~educ|factor(female), data = wage1)
women_wage<- subset(wage1, female == 1)
men_wage<- subset(wage1, female == 0)
par(mfrow = c(1,2))
plot(log(wage)~educ, data = men_wage, main = "Wages by Education for Men",
ylim = c(0,4), pch = 20)
abline(coef = coefs[1:2], col = "blue")
plot(log(wage)~educ, data = women_wage, main = "Wages by Education for Women",
ylim = c(0,4), pch = 20)
abline(coef = c(coefs[1]+coefs[3], coefs[2]), col = "red")
Essentially, the multiple regression changes the intercept for men and women: the model says that wages for men have are overall shifted higher. The slope of education stays the same, though. If we thought that slopes between men and women might differ, we would need to add an interaction term, \(educ*female\), to our model.
In the last lab, we verified that regression coefficients were unbiased in the simple linear regression setup. Let’s now check that coefficients are also unbiased in the multiple regression setting.
# True parameters (unknown)
beta0 = 32
beta1 = 0.5
beta2 = 1.2
Lets draw data according to the multiple linear regression model. What is the generative model?
#################
# draw data according to the linear regression model.
#################
n_obs <- 100 # number of observations
sig <- 6 # std error of the y's
x1 <- runif(n_obs, 0, 10) # draw x's: they will be fixed for our experiment
x2 <- runif(n_obs, 0, 10)
draw_data <- function(x1, x2, beta0, beta1, beta2, sig){
n_obs <- length(x1)
# draw y's
y = rnorm(n_obs, mean = beta0 + beta1*x1 + beta2*x2, sd = sig)
return(data.frame(x1 = x1, x2 = x2, y = y))
}
dataset <- draw_data(x1,x2, beta0, beta1, beta2, sig)
head(dataset)
## x1 x2 y
## 1 1.954497 1.929700 40.27170
## 2 7.627452 7.502178 55.13422
## 3 8.932281 8.804351 55.88279
## 4 9.793232 4.839827 40.71721
## 5 5.653229 5.283787 31.72953
## 6 7.092454 4.303021 42.70961
Lets compute the regression coefficients using our simulated dataset:
# Compute the regression coefficients from simulated data:
lm(y ~ x1 + x2, data = dataset)
##
## Call:
## lm(formula = y ~ x1 + x2, data = dataset)
##
## Coefficients:
## (Intercept) x1 x2
## 30.3414 0.5855 1.6137
Is it unbiased? We can run a simulation to check:
# draw multiple datasets from the model
# and for each dataset, recompute the regression coefficients.
n_trials <- 200
intercepts <- rep(0, n_trials)
X1_slopes <- rep(0, n_trials)
X2_slopes <- rep(0, n_trials)
for(i in 1:n_trials){
dataset <- draw_data(x1, x2, beta0, beta1, beta2, sig)
mlm <- lm(y ~ x1 + x2, data = dataset)
intercepts[i] <- mlm$coefficients[1]
X1_slopes[i] <- mlm$coefficients[2]
X2_slopes[i] <- mlm$coefficients[3]
}
What are the distributions of our estimated beta1 and beta0s?
# examine distribution of the intercepts
hist(intercepts)
abline(v = beta0, col = "red")
# examine distribution of the slopes
hist(X1_slopes)
abline(v=beta1, col = "red")
hist(X2_slopes)
abline(v=beta2, col = "red")
Do our estimates appear unbiased?
Is multiple regression really necessary? Let’s see what happens when we get the model wrong.
lm(y~x1, data = dataset)
##
## Call:
## lm(formula = y ~ x1, data = dataset)
##
## Coefficients:
## (Intercept) x1
## 36.5333 0.9447
Run the simulation again, but this time fit a model only on x1. Examine the distribution of the coefficients.
# run the simulation again, still drawing data from the model y~x1+x2,
# but this time fit a model only on x1
slm_intercepts <- rep(0, n_trials)
slm_slopes <- rep(0, n_trials)
for(i in 1:n_trials){
dataset <- draw_data(x1, x2, beta0, beta1, beta2, sig)
slm <- lm(y ~ x1, data = dataset)
slm_intercepts[i] <- slm$coefficients[1]
slm_slopes[i] <- slm$coefficients[2]
}
# see what happens to the coefficients
hist(slm_intercepts, breaks = 15, xlim = c(30, 45))
abline(v = beta0, col = "red")
hist(slm_slopes, breaks = 15)
abline(v = beta1, col = "red")
Are the coefficients still unbiased if we get the model wrong (if we don’t include x2 in the model)?
We can work through any problems you might have encountered. Key ideas from homework 1 were simple linear regression, interpreting coefficients, logs of variables, predictions, unbiasedness, and homoskedasticity.
Wooldridge, J.~M. (2000), Introductory Econometrics: A Modern Approach, South-Western College Publishing.↩